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Method for automatically matching 
graphic elements and phonetic elements 

REFERENCE TO RELATED APPLICATION 

5 

This application is a continuation of the PCT 
International Application No. PCT/FR2004/03278 filed 
December 17, 2004, which is based on the French 
Application No. 0314928 filed December 18, 2003. 

10 

BACKGROUND OF THE INVENTION 

1 - Field of the Invention 

The present invention relates generally to the 
15 automatic extraction of linguistic knowledges in a corpus 
of transcriptions of graphic chains into phonetic chains. 
It relates more particularly to the transcription of 
typographic elements such as characters in a 
predetermined language into phonetic elements. 

20 

2 - Description of the Prior Art 

At present, each word of a language constitutes a 
graphic chain that is transcribed phonetically into a 

25 chain of phonemes by a phonetician. For any new word to 
be added to a. training corpus, the phonetician must 
intervene to transcribe the new word phonetically. Thus 
the training corpus furnishes only global 
grapheme /phoneme transcriptions. For example in the 

30 global transcription "ruelle"/ [ryel] , the corpus 
indicates that, globally, the graphic chain "ruelle" is 
translated into a phonetic chain. However, it is not made 
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explicit that the typographic element "r" is 
retranscribed phonetically in some unitary way. The 
global transcription does not indicate also the syllables 
or graphemes constituting the graphic chain and the 
phonetic elements constituting the phonetic chain. 

One or more phonetic chains associated with any 
graphic chain can be determined from the known elementary 
transcription of each typographic element by character by 
character analysis of the graphic chain. Error corrector 
systems find the phonetic transcriptions useful for 
recognizing lexical errors in entering text on a 
keyboard. There is therefore a need to extract more 
refined elementary transcriptions from a raw 
transcription. 

OBJECT OF THE INVENTION 

The invention aims to derive automatically from raw 
transcriptions of graphic chains, for example words and 
family names, into phonetic chains, transcriptions of 
graphic elements, for example characters, into phonetic 
elements constituting the phonetic chains, in order to 
segment any graphic chain into graphemes and any phonetic 
chain into phonemes automatically. The graphic element by 
graphic element, i.e. character by character, elementary 
transcriptions thereafter facilitate automatic global 
transcription of any additional graphic chain added to 
the corpus of graphic chains, in particular on the basis 
of a concatenation of phonetic elements matching on a one 
to one basis to the characters of the additional graphic 
chain. 
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SUMMARY OF THE INVENTION 

Accordingly, a method of the invention matches 
graphic elements constituting given graphic chains 
automatically to phonetic elements constituting 
corresponding phonetic chains after initially entering 
global transcriptions of the graphic chains into the 
phonetic chains into a database accessible by the 
computer and after estimating and storing in the database 
first probabilities of elementary transcriptions of 
graphic elements into respective phonetic elements. The 
method is characterized by the following steps: 

for each transcription of a given graphic chain 
with M graphic elements into a corresponding phonetic 
chain with N phonetic elements, determining by MxN 
iterations second probabilities of MxN second 
transcriptions .of M graphic chains resulting from M 
successive concatenations of 1 to M graphic elements into 
N phonetic chains resulting from N successive 
concatenations of 1 to N phonetic elements, each second 
probability of a second transcription depending on a 
preceding estimated first probability of last graphic and 
phonetic element of said second transcription and 
depending on the highest of three respective second 
probabilities determined by preceding iterations, M and N 
being integers,, and 

establishing and storing a link between the last 
elements of the graphic and phonetic chains of each 
second transcription and the last elements of the graphic 
and phonetic chains of the transcription relating to the 
highest of the three respective second probabilities in 
order for links established in an MxN matrix relative to 
the second probabilities to constitute a single path 
between last and first pairs of graphic and phonetic 
elements of the matrix in order to segment the given 
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graphic chain into graphemes corresponding to respective 
phonemes segmenting the corresponding phonetic chain and 
to store the matches between the graphemes and phonemes 
in the database, the number of graphic elements in a 
5 grapheme being identical to the number of phonetic 
elements in the corresponding phoneme, in order for any- 
new graphic chain to be transcribed automatically into a 
phonetic chain segmented into phonemes by means of the 
stored matches . 

10 According to other features of the invention, the 

respective first probability for the determination of a 
second probability relating to a second transcription of 
a graphic chain concatenating m graphic elements into a 
phonetic chain concatenating n phonetic elements, with 

15 1 < m ^ M and 1 < n ^ N, relates to the last elements in 
the graphic chain with m graphic elements and the 
phonetic chain with n phonetic elements . The three 
respective second probabilities determined beforehand for 
the second transcription of the graphic chain with m 

20 graphic elements into the phonetic chain with n phonetic 
elements preferably and respectively relate to a second 
transcription of a graphic chain with m-1 graphic 
elements into the phonetic chain with n phonetic 
elements, a second transcription of the graphic chain 

25 with m graphic elements into a phonetic chain with n-1 
phonetic elements and a second transcription of the 
graphic chain with m-1 graphic elements into the phonetic 
chain with n-1 phonetic elements. 

For example, the invention transcribes phonetically 

30 from the corpus of global transcriptions such as 
"ruelle" | [ryel] the graphic elements "r", "u", "e", "lie" 
into the respective phonetic elements [r] , [y] , [e] , [1] . 

The invention may be regarded as similar to a 
process of syllabation which, by analysis, decomposes a 
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global transcription into elementary transcriptions and 
locally matches grapheme /phoneme subtranscriptions . The 
division into initial graphemes and phonemes and the 
biunivocal matching of each graphic element to each 
5 phonetic element of the divided phonemes is called 
grapheme | phoneme alignment. In the above example, the 
invention produces the following alignment: 
"r" "u" "e" "lie" 

[r] [y] [e] [1**]. 

10 The symbol * denotes a mute and meaningless phonetic 
element . 

BRIEF DESCRIPTION OF THE DRAWINGS 

Other features and advantages of the present 
15 invention will become more clearly apparent from the 
reading of the following description of preferred 
embodiments of the invention, given by way of nonlimiting 
examples and with reference to the corresponding appended 
drawings, in which: 
20 - FIG. 1 shows an algorithm of the main steps of 

the automatic matching method of the invention; and 

- FIG. 2 shows an algorithm of the substeps of a 
step of the automatic matching method for determining 
individual first probabilities. 

25 

DETAILED DESCRIPTION OF THE DRAWINGS 

As shown in FIG. 1, the method of the invention for 
automatically matching graphic elements and phonetic 
elements comprises main steps El to Ell. For example, 
30 those steps are for the most part implemented in the form 
of software in a terminal, such as a personal computer or 
a mobile in a cellular radio communication network, and 
linked in particular to software system for orthographic 
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correction of lexical errors which is insertable into a 
word processing system or a linguistic practice system. 
The terminal contains or is able to access a database of 
the type used in artificial intelligence. The database 
stores a corpus C of initial global transcriptions. 

Initially, in the step El, the global 
transcriptions (CG|CP) are constituted by pairs each 
matching a graphic chain CG such as a word in a 
predetermined language or a family name to a phonetic 
chain CP. These transcriptions are determined and entered 
by an expert in phonetics on a form displayed by the 
computer. The corpus C matches a priori graphic chains GC 
each composed of one or more typographic elements 
(characters) hereinafter called graphic elements g ± of an 
alphabet G = {q lf g x } with I elements in the 

predetermined language, where 1 ^ i < M, to respective 
phonetic chains CP each composed of one or more phonetic 
elements pj of an alphabet P = {p x , pj} with J 

phonetic elements, where 1 ^ j ^ J and I + J. However, 
the segmentation of the chain CG into syllables or into 
graphemes each comprising one or more graphic elements 
and the segmentation of the chain CP into phonemes each 
comprising one or more phonetic elements are ignored at 
this stage. 

The alphabets G and P typically comprise around 30 
elements. There are therefore a total of 30 x 30 = 900 
possible pairs of graphic elements and phonetic elements. 
In practice, the corpus C contains at least 100 000 
global transcriptions of typographic chains CG into 
phonetic chains CP, which protects the invention from 
coarse errors in the estimation of probabilities, as 
discussed below. 

In the step E2, first probabilities of elementary 
transcription P(g±|pj) such that a graphic element gi 
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matches the phonetic element pj are firstly estimated and 
stored in the database with the corpus C of global 
transcriptions . 

The estimated values of the first probabilities are 
as far as possible close to respective maximum 
probability values required for the method of the 
invention operating by iterations to converge quickly 
without retaining local maxima. 

The concatenated nature of the global 
transcriptions of the chains leads to the hypothesis of a 
correlation between the rank r g of the graphic elements 
in a graphic chain CG and the rank r p of the phonetic 
elements in the corresponding phonetic chain CP. For 
example, in the global transcription (beau | bo), it is 
more probable that the graphic element b, given its 
position at the beginning of the chain CG, translates to 
a phonetic element [b] rather than a phonetic element [o] 
placed at the end of the corresponding chain CP. In this 
example, the correlation of the ranks moves the graphic 
elements [b] and [e] of the phonetic element [b] and the 
graphic elements [a] and [u] of the phonetic element [o] 
closer together. 

The algorithm for the initial estimation E2 of the 
first probabilities P(gilPj) comprises the following 
substeps E21 to E27. 

In the substep E21, IJ contingency numbers K glpj 
respectively associated with the elementary 
transcriptions (gilPj) of a graphic element of the 
alphabet G and a phonetic element of the alphabet P are 
set to zero. The contingency number K gip j is equal at the 
end of the step E2 to the estimated number of times that 
the graphic element gi is retranscribed into the phonetic 
element Pj in the various global transcriptions of 
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typographic chains CG into phonetic chains CP included in 
the corpus C. 

For each chain transcription (CG|CP), as indicated 
in the substep E22, the ranks of the graphic elements in 
the chain CG and the ranks of the phonetic elements in 
the chain CP are normalized as a function of the 
respective lengths l g and l p of the chains CG and CP, 
which may be different. In the substep E23, the rank r of 
a phonetic element in the chain CP is derived from the 
rank r gi of a graphic element g ± in the chain CG with 
which the phonetic element of rank r will be associated, 
in accordance with the following relationship: 

r = integer portion {r gi .l p /l g ). 

The number K gipj of contingencies associated with 
the elementary transcription of the graphic element g ± 
into the phonetic element Pj is then incremented by 1 
only if the phonetic element pj is situated at the 
derived rank r in the chain CP, as indicated in the 
substeps E24 and E25. 

The substeps E22 to E25 are repeated for each 
global transcription (CG|CP) of the corpus C, as 
indicated in the substep E2 6. When all the global 
transcriptions of the corpus have been processed, the 
next substep E27 estimates all the first probabilities 
P(gilPj) of elementary transcription between the graphic 
elements and the phonetic elements, in accordance with 
the following relationship for each graphic element g±: 
j=J 

P(g± I Pj) = Kgipj / Z Kgipi 

j=l 

after calculating the sum term in the denominator for the 
graphic element gi. 

Referring again to FIG. 1, the matching process 
continues with steps E3 to E10 which segment each graphic 
chain CG read in the corpus of the database in order to 
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match automatically and on a biunivocal basis each 
segment of the chain CG, called a grapheme, comprising 
one or more graphic elements, to a segment, called a 
phoneme, comprising one or more phonetic elements 
5 resulting from segmentation of the corresponding phonetic 
chain CP. 

A graphic chain CG comprises M consecutive graphic 
elements gi to g M and the phonetic chain CP corresponding 
to the chain CG comprises N consecutive phonetic elements 

10 pi to p N . The integer N may be different from or equal to 
the integer M. 

The probability P (gi, . . . g m , . . . g M | pi, . . . P n , . . . p N ) that 
the chain CG matches the chain CP, where 1 ^ m ^ M and 
1 < n < N, is determined as a function of the first 

15 elementary transcription probabilities P(g±|pj) estimated 
and stored beforehand in the step E2 and from similarity 
between the chains CG and CP. The similarity is based on 
the Damerau-Levenshtein Metric (DLM) but using 
maximization instead of minimization. The probability 

20 P (CG| CP) is determined by dynamic programming using the 
following iterative formula for any pair m, n such that 
1 < n < N and 1 < m < M: 

P(gig 2 . ■ .g n lpiP2. ■ -Pn) -P (g»l pjmax [P (gig 2 . . .g m -ilpip 2 . • .Pn>, 
P(gig 2 . • .g«IPiP2- • -Pn-i) , P(gig 2 - • • gm-i I P1P2 • • -Pn-i) ] - 

25 The concatenated nature of the global chain 

transcriptions and the grapheme /phoneme transcriptions 
means that Markov models may be applied efficaciously. 
For the given probability of transcription of a chain 
gi,g 2 ...g m into a chain pip 2 . . .p n , the extension of the 

30 graphic, respectively phonetic, chain by a new graphic 
element g m+ i, respectively phonetic element p n +i, gives 
rise either to the same phonetic chain, respectively 
graphic chain, or to the addition of a new phonetic 
element, respectively graphic element. Expressed in terms 
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of probability, P (gig 2 . . .g m +i I P1P2 • • -Pn+i) depends only on 
the probabilities of three possible transcriptions: 

P(gig 2 . . .g m IPiP2. • -Pn+i) , 

P(gig 2 . . .g m+ ilpiP2. • -Pn) , 

P(gig 2 . . .g m IPiP2. • -Pn) • 

That dependency is expressed by the DLM metric 
equal to the highest of the above three possibilities. 

After setting the indices m and n to zero for a 
global transcription (CG|CP) in the step E3 and 
incrementing the indices m and n by 1 in the steps E4 and 
E5, iterations in the steps E6 and E7 begin by 
determining the probabilities so that the M successive 
concatenations of the graphic elements gi to g M of the 
chain CG match the first phonetic element pi of the chain 
CP, i.e.: 

P(gi, • • .gmlPi) = P(gmlPi) max[P(g x . . .g m -ilpi) ] 
where 1 ^ m ^ M, and starting with the elementary 
probability P(gilpi). As shown by the step E8, the 
process then determines by iteration the probabilities of 
the M concatenations of the graphic elements gi to g M of 
the chain CG matching the first two phonetic elements pi 
and p 2 of the chain CP using the probabilities previously 
determined for the first graphic element pi, i.e.: 
P(gi/ • • -gmlPi/ P2) = P(g m IP2) max[P(gi, . . .g m -ilp 2 ) , 
P( gi , . . .g m |pi) , P(gi, . . .g m -ilpi) ] . 

The process then continues by adding a phonetic 
element p n to determine the M probabilities P (gi I Pi, . . .p n ) 
to P (gi, . . . , g M I pi, . . .p n ) up to the M probabilities 
relating to the chain CP = (pi, . . .p N ) . By iteration of 
the steps E4 to E8, the computer progressively constructs 
and stores a matrix of second probabilities 
P (gi, . . . g m I Pi, . . .p n ) with M columns for successive 
concatenations of the M graphic elements and N rows for 
successive concatenations of the N phonetic elements, 



Substitute Specification-Clean Copy 



operating row by row as in the above example, beginning 
with the probability P(gilpi) and ending with the 
probability P (g l7 . . .g M lpi, • ■ -Pn) • 

Each iteration relating to the (m.n) th 
5 transcription [ (g lf . . . g m ) | (p lf . . .p n ) ] establishes a link 
between the pair (g m ,p„) and the pair with the highest of 
the three probabilities determined beforehand for the 
three pairs (g m -i,p n ) , (g m ,Pn-i) and (g m -i, p n -i) • The link is 
stored in the computer. If the pair (g m ,p n ) is linked to 

10 the pair (g ra -i,p n ), it is an elementary transcription from 
(gm-irgm) to p n ; if the pair (g m , p n ) is linked to the pair 
(gm,Pn-i), it is an elementary transcription from g m to 
(Pn-:uPn) ; if the pair (g m ,p n ) is linked to the pair (g m _ 
i,p n -i) , it is an' elementary transcription from g m to p n . 

15 Thus a link is stored in the computer for each 

determination of a probability P(gi, . . . g m ) I (pi, . . .p n ) • The 
links trace a single path that is also stored 
progressively in the computer and links the first pair 
(gi, Pi) to the last pair (g M , p N ) in the matrix with M 

20 columns and N rows. The topology of the single path in 
the MxN matrix segments the graphic chains CG into 
graphemes and the phonetic chains CP into phonemes and 
aligns the graphic elements and the phonetic elements in 
biunivocal correspondence. If a segment of the path 

25 follows a portion of a row between two graphic elements, 
the concatenation of the graphic elements of that row 
portion corresponds to the phonetic element of the row 
completed by one or more mute and meaningless phonetic 
elements in order to form a grapheme and phoneme pair 

30 that has the same number of elements and is stored in the 
computer. If a segment of the path follows a column 
portion between two phonetic elements, the graphic 
element of the column plus one or more meaningless 
graphic elements corresponds to the concatenation of the 
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phonetic elements of that column portion in order to form 
a grapheme and phoneme pair that has the same number of 
elements and is stored in the computer. A change of 
direction of the path in the matrix towards the 
5 horizontal, the vertical or the diagonal indicates 
segmentation of the chains CG and CP. 

A simple example concerns seeking to segment the 
global transcription of the word CG = "beau" into the 
phonetic chain CP = [bo] on the assumption that the step 
10 E2 estimated the following first individual probabilities 
in the corpus C: 

P(b|b)=0.9 ; P(e|b)=0.1 ; P(a|b)=0.1 ; P(u|b)=0.1 
P(e|o)=0.2 ; P(a|o)=0.1 ; P(u|o)=0.2 ; P(b|o)=0.1. 

For the transcription (beau | bo) from the corpus, 
15 the M=4 iterations of the steps E5, E6 and E7 for each of 
the N=2 rows of the 4x2 matrix produce the following 
table: 



Pn / gm 


b = gi 


e = g 2 


a = g 3 


u = g 4 


[b] = pi 


0,9 


<-0, 09 


<r0, 09 


<r0, 0009 


[o] = p 2 


1^0,09 


AX), 18 


<-0, 018 


<-0, 0036 



The symbol <r indicates that the pair (g m , p n ) is 
20 linked to the pair (g m -i, p n ) ; the symbol 1 s indicates 
that the pair (g m , p n ) is linked to the pair (g m , p n -i) ; 
and the symbol K indicates that the pair (g m , p n ) is 
linked to the pair (g m -i, p n -i) • The symbol K associated 
with the transcription (be I bo) indicates that the latter 
25 has been derived and is therefore linked to the preceding 
transcription (b|b). The symbol K indicates a 
segmentation boundary between grapheme and phoneme pairs. 
The following alignment is derived from this table: 
b eau 
30 b o**. 
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The symbol * designates a mute and meaningless phonetic 
element . 

To perfect the matches between graphemes and 
phonemes and the matches between graphic elements and 
5 phonetic elements, preferably in the manner indicated by 
the step Ell, the first probabilities P(gilpi) to P(gilpj) 
of the transcriptions of each of the graphic elements 
respectively into the J phonetic elements (step E2) and 
in particular the contingency numbers K g i p i to K g i pJ 

10 (substep E25) are again estimated as a function in 
particular of the ranks of the phonetic elements placed 
in the given phonetic chains CG that were segmented into 
phonemes in the preceding step E10. Second probabilities 
P(gi/ • • -gmlPi, • • -Pn) of MxN second transcriptions of each 

15 global transcription of a given graphic chain with M 
graphic elements (CG) into a corresponding phonetic chain 
(CP) with N phonetic elements are determined by executing 
the steps E3 to E10 in order for links to be established 
in the next step E10 between pairs (g„uPn) of a new 

20 matrix with M columns and N rows and consequently for a 
corrected path to link the last pair ( gM / Pn ) to the first 
pair (gi,Pi) in the new MxN matrix of second 
probabilities . 

Thanks to the processing capacity and high 

25 processing speed of the computer, other iterative loops 
of steps E2 to Ell may be executed in the computer until 
the matching process converges, i.e. until the path 
established becomes constant from one loop to the next. 

After segmentation of all the graphic and phonetic 

30 chains of the corpus G into graphemes and phonemes, the 
database stores ■ all matches between graphic and phonetic 
elements and all matches between graphemes and phonemes 
for the whole of the processed corpus C. 
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Any new graphic chain added to the corpus can then 
be transcribed automatically into a phonetic chain 
segmented into phonemes, in particular with the aid of 
the matches previously established and stored in 
5 accordance with the invention, which progressively 
enriches the corpus in the database and increases 
transcription accuracy. 

As already stated, the phonetic transcriptions are 
useful to orthographic error correction software systems 
10 that recognize lexical errors when entering text on a 
terminal keyboard. Thus when the new graphic chain added 
to the corpus is being entered on a terminal keyboard, 
the phonetic chain segmented into phonemes by means of 
the stored matches is used for orthographic correction of 
15 the new graphic chain entered. 

The method of the invention may equally well be 
used as a tool for automatically generating SMS short 
messages from a text written in ordinary language. This 
necessitates a training corpus C the transcriptions 
20 whereof are adapted to the automatic generation of SMS 
messages and respectively match graphic chains CG, such 
as words and phrases, to phonetic chains CP whose 
"phonemes" are phonetically readable by any person who is 
not an expert in phonetics. For example, the corpus 
25 establishes the' following matches (in French) between 
graphic chains and phonetic chains: 
j ' ai : G 
air : R 
occupe : OQP 
30 cas : K. 

Thus a new graphic chain entered in a terminal is 
automatically transcribed by the method of the invention 
into a phonetic chain segmented into phonemes that can be 
read by any person who is not an expert in phonetics by 



14 



Substitute Specification-Clean Copy 



means of stored matches to be included in an SMS message. 
In the foregoing example, the French phrase "j'ai l'air 
occupe" entered on the terminal is transcribed 
automatically into the following short message to be 
5 transmitted by the terminal: Gl'ROQP, the "phonetic 
chains" [G] , [l']# [R] and [OQP] being phonetically 
readable by any user who is not an expert in phonetics. 
Alternatively, the phonetic chains [G] , [1'], [R] and 
[OQP] may be treated as phonetic elements to constitute a 

10 phonetic chain [Gl'ROQP]. 

The steps of a preferred embodiment of the method 
of the invention are determined by instructions of a 
computer program incorporated into a computer such as a 
terminal, a personal computer, a server or any other 

15 electronic data processing system. The program 
automatically matches graphic elements constituting given 
graphic chains to phonetic elements constituting 
corresponding phonetic chains, after initially entering 
global transcriptions of the graphic chains into the 

20 phonetic chains into a database accessible to the 
computer and estimating and storing in the database first 
probabilities of elementary transcriptions of graphic 
elements into respective phonetic elements. The program 
includes program instructions which execute the steps of 

25 the method of the invention when said program is loaded 
into and executed in the computer, the operation whereof 
is then controlled by executing the program. 

Consequently, the invention applies equally to a 
computer program adapted to implement the invention, in 

30 particular a computer program on or in an information 
medium. This program may use any programming language and 
take the form of source code, object code or an 
intermediate code between source code and object code, 
such as a partially compiled form, or any other form that 



15 



Substitute Specification-Clean Copy 

may be desirable for implementing the method of the 
invention . 
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